In [1]:
%matplotlib inline
import numpy as np
import pandas as pd
from scipy import stats
import matplotlib.pyplot as plt
import seaborn as sns
In [2]:
sns.set_context("talk")
sns.set_style("ticks")
np.random.seed(1337)
I will be writing about the Extreme value theory (EVT) which was introduced to me by my brother Sudhanshu, while he was working on his internship project. I really liked the connection it has with central limit theorem (CLT). The approach allowed me to better understand central limit theorem as a way to identify distribution of a function applied to independent and identically distributed (i.i.d.) samples. For CLT, the function is the mean or sum applied to the samples. For EVT the function is max or min.
I will do a multi part series trying to explain EVT using examples from real data. In the first part, the focus is on simply understanding the core ideas and introducing the limiting distributions for EVT.
Most of us know about the central limit theorem. It states that the sample means for i.i.d. samples from a distribution with finite variance, follows a normal distribution. More formally, we can write it as follows:
$$ \begin{equation} X_1, X_2, ..., X_n \sim p(X)\\ S_n = \frac{\sum_i X_i}{n}\\ S_n \sim \mathcal{N}(\mu, \sigma^2)\\ \end{equation} $$Similarly, there exists a the Extreme value theory. This theory deals with the distribution of the extreme values (like minimum or maximum) from the (i.i.d.) samples from some distribution.
$$ \begin{equation} X_1, X_2, ..., X_n \sim p(X)\\ M_n = \max_i X_i\\ M_n \sim {\textrm {GEV}}(\mu ,\,\sigma ,\,\xi ) \\ \end{equation} $$Surprisingly, like the central limit theorem, these values also converge on a class of distributions called the generalized extreme value distribution which characterizes three types of distributions. These distributions are:
Weibull law: $G(z)={\begin{cases}\exp \left\{-\left(-\left({\frac {z-b}{a}}\right)\right)^{\alpha }\right\}&z<b\\1&z\geq b\end{cases}}$ when the distribution of $M_{n}$ has a light tail with finite upper bound. Also known as Type 3.
Gumbel law: $G(z)=\exp \left\{-\exp \left(-\left({\frac {z-b}{a}}\right)\right)\right\}{\text{ for }}z\in \mathbb {R}$ . when the distribution of $M_{n}$ has an exponential tail. Also known as Type 1
Fréchet Law: $G(z)={\begin{cases}0&z\leq b\\\exp \left\{-\left({\frac {z-b}{a}}\right)^{-\alpha }\right\}&z>b.\end{cases}}$ when the distribution of $M_{n}$ has a heavy tail (including polynomial decay). Also known as Type 2.
In all cases, $\alpha >0$. [Source: https://en.wikipedia.org/wiki/Extreme_value_theory#Univariate_theory ]
The good thing about knowing the distribution of $M_n$ is that we can quantify what is the probability of observing a value as extreme as $M_n$. In case of maximum this can be computed by using the cumulative density function (CDF) as $P(M_n < x)$. If we set a threshhold on this probability (say $\delta = 99\%$) then we can either make systems which are robust to the extreme value $\delta$ percentage of times, also we can also know if we observe something which has CDF more than $\delta$, then it is a very rare event.
Let us see this in action. First we are going to create a draw_samples
function which allows us to draw samples from various disitributions available in scipy.stats
. We will draw samples of size $n$ and will draw $k$ such samples.
In [68]:
def draw_samples(dist, n, k=1):
return dist.rvs(size=(n,k))
def plot_samples(dists, aggregations, n=1000, k_max=10000):
rows, cols = len(aggregations), len(dists)
fig, ax = plt.subplots(rows, cols, figsize=(5*cols, 3*rows))
fig_cum, ax_cum = plt.subplots(rows, cols, figsize=(5*cols, 3*rows))
for i, (dist_name, dist) in enumerate(dists):
samples = draw_samples(dist, n, k_max)
dist_mean = dist.mean()
dist_std = dist.std()
for j, (agg_name, agg) in enumerate(aggregations):
for k in [int(10**p) for p in np.arange(1, np.log10(k_max)+1)]:
normed_samples = samples[:, :k]
#normed_samples = (samples[:, :k] - dist_mean)/dist_std
X = agg(normed_samples, axis=1)
#print(dist_name, agg_name, k, X.shape)
ax[j, i].hist(
X, bins=100, weights=np.ones_like(X)/X.shape[0],
cumulative=False,
histtype="step",
label=f"$k={k}$"
)
ax_cum[j, i].hist(
X, bins=100, weights=np.ones_like(X)/X.shape[0],
cumulative=True,
histtype="step",
label=f"$k={k}$"
)
if j == 0:
ax[j, i].set_title(dist_name)
ax_cum[j, i].set_title(dist_name)
if i == 0:
ax[j, i].set_ylabel(f"$p({agg_name}[X_k])$")
ax_cum[j, i].set_ylabel(f"$p({agg_name}[X_k] < x)$")
if j == rows-1 and i == cols-1:
ax[j, i].legend(loc="best", bbox_to_anchor=(1.01, 0.5))
ax_cum[j, i].legend(loc="best", bbox_to_anchor=(1.01, 0.5))
sns.despine(fig=fig, offset=10)
sns.despine(fig=fig_cum, offset=10)
fig.tight_layout()
fig_cum.tight_layout()
return (fig, ax), (fig_cum, ax_cum)
In [69]:
samples = draw_samples(stats.norm(), 100, 10000)
samples.shape
Out[69]:
Let us compare the distribution of means, max, and min of samples from a few distributions distributions e.g. uniform, normal, exponential, logistic, and powerlaw. We are going to plot the PDF as well as the CDF of the distributions. The PDF tells us the neighborhood of high probability density, the CDF gives us a way to bound a certain percentage of the data.
In [71]:
dists = [
("uniform", stats.uniform()),
("norm", stats.norm()),
("expon ", stats.expon()),
("logistic", stats.logistic()),
("powerlaw", stats.powerlaw(a=5)),
]
aggregations = [
("mean", np.mean),
("max", np.max),
("min", np.min),
]
plot_samples(dists, aggregations, n=1000, k_max=10000);
As we can observe, as the number of samples $k$ increases, the estimation of the mean and the extreme values converges to the true value. However, unlike the central limit theorem, where the variance of estimated mean reduced for higher $k$, for extreme value distributions, the distribution shifts to either right (for max) or left (for min) as $k$ increases. The variance the extreme value distributions also reduces as $k$ increases.
For bounded distributions like uniform, exponential, and powerlaw, the extreme value converges to to the bound as $k$ increases.
In [3]:
normal = stats.norm()
gumbel = stats.genextreme(c=0)
frechet = stats.genextreme(c=-1)
weibull = stats.genextreme(c=1)
In [4]:
fig, ax = plt.subplots(1,2, sharex=True, sharey=True, figsize=(15, 6))
x = np.arange(-3, 5.001, 0.001)
ax[0].plot(x, normal.pdf(x), label="normal", lw=1)
ax[0].plot(x, gumbel.pdf(x), label="gumbel", lw=1)
ax[0].plot(x, frechet.pdf(x), label="frechet", lw=1)
ax[0].plot(x, weibull.pdf(x), label="weibull", lw=1)
ax[0].axvline(x=0, lw=0.5, linestyle="--", color="k")
ax[0].set_ylabel("$P(X=x)$")
ax[0].set_xlabel("$x$")
ax[1].plot(x, normal.cdf(x), label="normal", lw=1)
ax[1].plot(x, gumbel.cdf(x), label="gumbel", lw=1)
ax[1].plot(x, frechet.cdf(x), label="frechet", lw=1)
ax[1].plot(x, weibull.cdf(x), label="weibull", lw=1)
ax[1].axvline(x=0, lw=0.5, linestyle="--", color="k")
ax[1].axhline(y=0.9, lw=0.5, linestyle="--", color="k")
ax[1].set_ylabel("$P(X\leq x)$")
ax[1].set_xlabel("$x$")
lgd = fig.legend(
*ax[1].get_legend_handles_labels(),
bbox_to_anchor=(0.8, 1.05),
bbox_transform=fig.transFigure,
ncol=4
)
for lh in lgd.legendHandles:
lh.set_linewidth(5)
sns.despine(offset=10)
fig.tight_layout()
In [5]:
iris = sns.load_dataset("iris")
iris.head()
Out[5]:
In [6]:
g = sns.pairplot(iris, hue="species", markers=["o", "s", "D"])
In [7]:
g = sns.pairplot(iris[["petal_length", "petal_width", "species"]], hue="species", markers=["o", "s", "D"])
In [8]:
X = iris[["petal_length", "petal_width"]]
y = iris["species"]
In [9]:
def get_known_mean(X, y, known_classes):
known_idx = y.isin(known_classes)
known_mean = X[known_idx].groupby(y[known_idx]).mean()
return known_mean
def plot_known_mean(X, y, known_mean):
ax = sns.scatterplot(data=X, x="petal_length", y="petal_width", hue=y, alpha=0.5)
ax.plot(known_mean["petal_length"], known_mean["petal_width"], marker="*", linestyle="none", ms=20, color="k", alpha=0.9)
sns.despine(offset=10)
return ax
In [10]:
known_classes = ["setosa", "versicolor"]
known_mean = get_known_mean(X, y, known_classes)
known_mean
Out[10]:
In [11]:
plot_known_mean(X, y, known_mean)
Out[11]:
In [12]:
def get_errors(X, y, known_mean):
errors = {
k: pd.concat([
np.square(X[y==k] - known_mean.loc[cls]).sum(axis=1)
for cls in known_mean.index
], axis=1).min(axis=1)
for k in y.unique()
}
return errors
def get_error_bins(errors, nbins=40):
tmp = np.hstack(list(errors.values()))
print(tmp.shape)
bins=np.histogram(tmp,bins=nbins)[1]
print(bins.min(), bins.max())
return bins
In [13]:
errors = get_errors(X, y, known_mean)
bins = get_error_bins(errors, nbins=40)
In [14]:
errors["setosa"].shape
Out[14]:
In [15]:
def fit_distributions(errors, known_mean):
known_errors = pd.concat([errors[cls] for cls in known_mean.index])
c, loc, scale = stats.genextreme.fit(
known_errors,
loc=known_errors.mean(),
scale=known_errors.std()
)
print(f"c={c}, loc={loc}, scale={scale}")
extreme_dist = stats.genextreme(c, loc, scale)
print(extreme_dist.stats("mvsk"))
loc, scale = stats.norm.fit(
known_errors,
)
print(f"loc={loc}, scale={scale}")
norm_dist = stats.norm(loc, scale)
print(norm_dist.stats("mvsk"))
return extreme_dist, norm_dist
In [16]:
extreme_dist, norm_dist = fit_distributions(errors, known_mean)
In [17]:
def plot_distributions(extreme_dist, norm_dist, bins):
fig, ax = plt.subplots(1,2, sharex=True, sharey=False, figsize=(8, 3))
lower = bins[0]
#higher = max([extreme_dist.mean() + 2*extreme_dist.std(), norm_dist.mean() + 2*norm_dist.std()])
higher = bins[-1]
x = np.linspace(lower, higher, 1000)
ax[0].plot(x, norm_dist.pdf(x), label="normal", lw=10, alpha=0.5)
ax[0].plot(x, extreme_dist.pdf(x), label="extreme", lw=1)
#ax[0].axvline(x=0, lw=0.5, linestyle="--", color="k")
ax[0].set_ylabel("$P(X=x)$")
ax[0].set_xlabel("$x$")
ax[1].plot(x, norm_dist.cdf(x), label="normal", lw=10, alpha=0.5)
ax[1].plot(x, extreme_dist.cdf(x), label="extreme", lw=1)
#ax[1].axvline(x=0, lw=0.5, linestyle="--", color="k")
ax[1].axhline(y=0.9, lw=0.5, linestyle="--", color="k")
ax[1].set_ylabel("$P(X\leq x)$")
ax[1].set_xlabel("$x$")
lgd = fig.legend(
*ax[1].get_legend_handles_labels(),
bbox_to_anchor=(0.8, 1.05),
bbox_transform=fig.transFigure,
ncol=4
)
for lh in lgd.legendHandles:
lh.set_linewidth(5)
sns.despine(offset=10)
fig.tight_layout()
In [18]:
plot_distributions(extreme_dist, norm_dist, bins)
In [19]:
def plot_error_cdf(errors, known_classes, extreme_dist, norm_dist, bins):
all_classes = y.unique()
ncols = min(4, len(all_classes))
nrows = round(len(all_classes)/ncols)
fig, ax = plt.subplots(
nrows, ncols, sharex=True, sharey=True,
figsize=(5*ncols, 4*nrows)
)
x = np.arange(bins[0], bins[-1]+0.01, 0.01)
threshhold_cdf = 0.9
threshhold_x = extreme_dist.ppf(threshhold_cdf)
for i, (axi, cls) in enumerate(zip(ax.flatten(), all_classes)):
x_err = errors[cls]
color = "red"
if cls in known_classes:
color = "green"
axi.hist(
x_err, bins, cumulative=True, density=True,
#weights=np.ones_like(errors)/np.shape(errors)[0],
edgecolor=color, linewidth=10, histtype="step", alpha=0.5
)
axi.plot(x, extreme_dist.cdf(x), lw=1, color="k", label="extreme")
axi.plot(x, norm_dist.cdf(x), lw=10, color="b", alpha=0.1, label="normal")
axi.axhline(y=threshhold_cdf, lw=1, linestyle="--", color="0.5")
axi.axvline(x=threshhold_x, lw=1, linestyle="--", color="0.5")
axi.set_title(f"Mean[{cls}]={np.mean(x_err):.4f}", fontsize=20)
axi.legend(loc="best")
sns.despine(offset=10)
fig.tight_layout()
In [20]:
plot_error_cdf(errors, known_classes, extreme_dist, norm_dist, bins)
In [21]:
def plot_cdf_cdf(errors, known_classes, extreme_dist):
all_classes = y.unique()
ncols = min(4, len(all_classes))
nrows = round(len(all_classes)/ncols)
fig, ax = plt.subplots(
nrows, ncols, sharex=True, sharey=True,
figsize=(5*ncols, 4*nrows)
)
bins = np.arange(0, 1.01, 0.01)
for i, (axi, cls) in enumerate(zip(ax.flatten(), all_classes)):
x_err = errors[cls]
color = "red"
if cls in known_classes:
color = "green"
errors_cdf = extreme_dist.cdf(x_err)
axi.hist(
errors_cdf, bins, cumulative=True, density=True,
edgecolor=color, linewidth=10, histtype="step", alpha=0.5
)
axi.set_title(f"Mean[{cls}]={np.mean(errors_cdf):.4f}", fontsize=20)
sns.despine(offset=10)
fig.tight_layout()
In [22]:
plot_cdf_cdf(errors, known_classes, extreme_dist)
In [23]:
def scatter_known_unknown(X, y, errors, extreme_dist):
fig = plt.figure()
known_unknown = pd.concat(errors.values()).apply(extreme_dist.cdf)
print((known_unknown > 0.95).value_counts())
points = plt.scatter(
data=X,
x="petal_length",
y="petal_width",
c=known_unknown,
s=40,
cmap="viridis_r"
)
plt.colorbar(points)
sns.despine(offset=10)
def scatter_error(X, y, errors):
fig = plt.figure()
points = plt.scatter(
data=X,
x="petal_length",
y="petal_width",
c=pd.concat(errors.values()),
s=40,
cmap="viridis_r"
)
plt.colorbar(points)
sns.despine(offset=10)
In [24]:
scatter_known_unknown(X, y, errors, extreme_dist)
scatter_error(X, y, errors)
In [25]:
known_classes = ["setosa", "virginica"]
known_mean = get_known_mean(X, y, known_classes)
known_mean
Out[25]:
In [26]:
plot_known_mean(X, y, known_mean)
errors = get_errors(X, y, known_mean)
bins = get_error_bins(errors, nbins=40)
extreme_dist, norm_dist = fit_distributions(errors, known_mean)
plot_distributions(extreme_dist, norm_dist, bins)
plot_error_cdf(errors, known_classes, extreme_dist, norm_dist, bins)
plot_cdf_cdf(errors, known_classes, extreme_dist)
scatter_known_unknown(X, y, errors, extreme_dist)
scatter_error(X, y, errors)
In [27]:
X = iris[["petal_length", "petal_width", "sepal_length", "sepal_width"]]
y = iris["species"]
In [28]:
known_classes = ["setosa", "virginica"]
known_mean = get_known_mean(X, y, known_classes)
known_mean
Out[28]:
In [29]:
errors = get_errors(X, y, known_mean)
bins = get_error_bins(errors, nbins=40)
extreme_dist, norm_dist = fit_distributions(errors, known_mean)
plot_distributions(extreme_dist, norm_dist, bins)
plot_error_cdf(errors, known_classes, extreme_dist, norm_dist, bins)
plot_cdf_cdf(errors, known_classes, extreme_dist)
In [30]:
from sklearn import datasets
In [31]:
faces = datasets.fetch_olivetti_faces()
In [32]:
print(faces.DESCR)
In [33]:
faces.data.shape
Out[33]:
In [34]:
faces.target.shape
Out[34]:
In [35]:
faces = pd.DataFrame(faces.data, columns=[f"f{i}" for i in range(faces.data.shape[1])]).assign(label=faces.target)
faces.head()
Out[35]:
In [36]:
X = faces[faces.label < 16].filter(regex="^f")
y = faces[faces.label < 16]["label"]
X.shape, y.shape, y.unique().shape
Out[36]:
In [37]:
known_classes = list(range(10))
known_mean = get_known_mean(X, y, known_classes)
known_mean
Out[37]:
In [38]:
%time errors = get_errors(X, y, known_mean)
extreme_dist, norm_dist = fit_distributions(errors, known_mean)
In [39]:
bins = get_error_bins(errors, nbins=40)
%time plot_distributions(extreme_dist, norm_dist, bins)
%time plot_error_cdf(errors, known_classes, extreme_dist, norm_dist, bins)
%time plot_cdf_cdf(errors, known_classes, extreme_dist)
In [ ]: